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Abstract 

The  foeus  of  this  paper  is  to  present  the  results  obtained  as  a  result  of  performing  entity 
information  retrieval,  namely  the  home  pages  of  produets,  organizations  and  persons. 
The  preliminary  results,  based  on  the  Indri  Seareh  Engine,  of  this  study  and 
experimentation  were  presented  at  the  Entity  Traek  in  TREC  2009.  Indri  Seareh  Engine  is 
an  effieient  and  effeetive  open  souree  tool,  whieh  is  operated  by  indri  query  language  in 
any  windows  or  UNIX  based  platform.  Indri  is  based  on  the  inferenee  network 
framework  and  supports  struetured  queries. 

Introduction 

The  Entity  Traek,  whieh  is  motivated  from  the  Enterprise  Traek,  was  introdueed  for 
finding  the  home  pages  of  entities  like  produets,  organizations  or  persons.  The  Enterprise 
traek  has  provided  a  platform  to  look  at  one  speeifie  entity  from  two  direetions.  The  first 
one  is  expert  finding,  whieh  finds  entities  in  the  eolleetion  (retrieving  entities  in  partieular 
eontext).  The  seeond  is  expert  profiling,  whieh  gets  insights  about  entities  (retrieving  the 
eontext  for  a  given  entity).  Historieally,  the  entity  is  something  whieh  has  a  home  page. 
Therefore,  Persons,  Produets,  and  Organizations  were  the  tree  types  of  entities  to  be 
eonsidered  in  the  proeess  of  information  retrieval.  During  the  study  eaeh  partieipant  is 
required  to  submit  results  for  the  given  queries  of  persons,  produets,  and  organizations. 
Eor  Entity  traek,  ClueWeb09  dataset  eomposed  of  1  billion  pages  in  10  languages  was 
used.  Based  on  the  instruetions  provided,  “category  B”  subset,  which  contains  about  50 
million  English  pages  for  the  entity  track,  was  used.  Eor  indexing,  the  Eemur  tool  kit  in 
Red  Hat  Enterprise  Einux  platform  and  for  retrieval  Indri  query  language  were  deployed 
for  the  web  pages.  The  files  were  indexed  to  form  two  repositories.  The  first  one  contains 
the  Wikipedia  pages  and  the  second  consists  of  pages  other  than  Wikipedia.  Wikipedia 
pages  were  given  as  optional  therefore  carried  less  importance.  Since  it  took  nearly  two 
days,  the  indexing  of  all  the  WARC  files  in  Indri  search  engine  was  timely  intensive 
process.  Indri  is  highly  efficient  and  effective  language  for  retrieving  the  pages  indexed 
and  it  supports  structured  queries.  It  gives  more  efficient  results,  when  we  indexed  more 
pages  into  the  search  engine,  when  compared  to  the  less  number  of  indexed  pages.  Using 
different  approaches/procedures,  up  to  four  different  results/runs  (UALRCB09rl, 
UAERCB09r2,  UALRCB09r3  and  UAERCB09r4  from  the  lowest  priority  to  the  highest 
priority) 
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Entity  Relation  Finding  Task 


Entity  related  finding  eonsiders  the  faet  that  given  the  name  and  homepage  of  an  entity, 
as  well  as  the  type  of  the  target  entity,  find  related  entities  that  are  of  target  type.  The 
input  eontains  the  name  and  homepage  of  an  entity,  type  of  target  entity,  and  eontext  for 
the  seareh.  The  output  doeuments  must  eontain  homepages  of  the  target  entities  and  the 
supporting  doeuments.  The  example  format  of  the  input  has  the  following  form. 

<query> 

<entity_name>kimi  raikkonen</ entity_name> 
<entity_URL>http://www.kimiraikkonen.oom/</entity_URL> 

<narrative>rd  like  to  know  whieh  organizations  are  sponsoring  kimi</narriative> 
</query> 

Now  the  solution  for  this  query  is  extraeted  by  using  the  Indri  seareh  engine.  The  souree 
of  the  index  files  for  the  repositories,  whieh  eontain  the  Wikipedia  pages  and  the  other 
repository  not  eontaining  the  Wikipedia  pages  were  given.  In  order  to  obtain  good  seores 
and  results  for  the  Information  Retrieval  different  eombinations  were  experimented.  The 
eombinations  inelude  the  terms  whieh  ean  extraet  the  home  page  of  the  given  entity  and 
rejeets  the  terms  (like  Wikipedia  pages),  whieh  doesn’t  retrieve  the  home  page.  Figure  1 
represents  the  systematie  approaeh  of  retrieving  the  home  page  of  a  given  entity.  The 
seareh  interfaee  will  be  at  the  user  end  and  it  will  be  aeeessed  by  giving  the  related 
entities.  The  query  will  be  proeessed  into  the  indri  seareh  engine  and  then  the  database 
will  be  aeeessed  and  the  required  home  page  of  the  entity  along  with  the  seores  will  be 
retrieved. 


Figure  1:  Systematie  approaeh  for  the  entity  related  finding 

Base  Run  with  Simple  Query 

For  indexing  the  Clueweb09  the  Femur  Tool  Kit  version  10  was  installed,  sinee  the 
ClueWeb09  eolleetion  was  in  the  WARC  format.  The  latest  version  was  designed  for  the 
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indexing  the  WARC  files  directly.  UALRCB09rl  is  our  base  run,  which  was  build  with 
simple  queries  for  all  the  topics  run  on  the  Clueweb09  collection.  For  example,  topic  1  is 
given  to  find  “carriers  that  blackberry  make  phones  for”.  It  is  converted  to  the  simple 
query  and  can  be  written  as  #combine  (carriers  blackberry  makes  phones).  Then  the 
results  along  with  the  scores  will  be  displayed  on  the  search  interface  of  each  and  every 
retrieved  page.  The  “carriers  of  the  blackberry”  task  was  initially  performed.  Then  we 
gave  the  queries  to  find  the  home  pages  of  the  carriers  of  blackberry  using  different 
queries  using  the  different  keywords  containing  in  the  required  home  pages.  The  graph  of 
the  nDCG  R  for  each  and  every  topic  obtained  is  shown  in  the  Figure  2. 


Topk  Numbers 

Figure  2:  Graph  of  nDCGR  for  UALRCB09rl 

Run  with  Complex  Queries 

Now  the  tasks  were  done  by  using  complex  queries  for  all  the  remaining  three  runs.  With 
these  three  runs  the  pages  from  the  ClueWeb09  collection,  which  was  indexed  using  the 
Lemur  were  also  retrieved.  In  addition,  the  queries  were  defined  according  to  the  output 
required.  For  instance,  first  all  the  pages  including  the  Wikipedia  pages  were  retrieved 
and  then  the  compiled  list  was  filtered  to  eliminate  the  Wikipedia  pages  via  query  (since 
the  Wikipedia  pages  were  optional).  At  least  10  pages  for  every  query  were  retrieved;  the 
first  two  were  taken  as  the  primary  pages  and  the  next  pages  as  the  supporting 
documents.  To  give  an  idea  of  the  complex  structured  queries,  let’s  consider  the 
following  example. 

#weight  (0.8  #filrej (Wikipedia  #combine (carriers  blackberry  makes  phones)  0.1 
#combine(#l  (carriers  blackberry)  #1  (blackberry  phones)  home  page)  0.1 
#combine(#uw8 (blackberry  carriers  home  page)) 

All  the  tasks  are  performed  once  again  using  the  complex  queries  and  the  outcome  is 
called  UALRCB09r2.  The  content  of  UALRCB09r2  contains  less  efficient  results  when 
compared  to  the  UALRCB09rl.  The  graph  of  the  UALRCB09r2  with  nDCG  R  for  all 
the  topics  was  represented  in  Figure  3. 
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UALRCB09r2 


Figure  3:  Graph  of  nDCG  R  for  UALRCB09r2 

UALRCB09r2  was  performed  with  the  complex  queries  and  now  a  new  approach  with 
different  query  structure  is  being  considered  for  the  UALRCB09r3.  After  adjusting  the 
precision,  more  efficient  results  were  obtained,  when  compared  to  the  UALRCB09r2. 
The  results  were  depicted  as  a  graph  in  Figure  4.  Finally  the  last  run,  called 
UALRCB09r4,  were  performed  as  a  result  of  a  small  refinement  to  see  if  the  outcome  can 
be  further  improved.  Figure  4  contains  the  graph  of  UALRCB09r3  and  nDCG  R  for  all 
of  the  topics  given. 


UALRCB09r3 


Topic  Numbers 


Figure  4:  Graph  of  nDCG  R  for  UALRCB09r3 


Results 

The  results  obtained  for  the  entity  track  were  tabulated  in  the  Table  1  for  all  of  the  four 
runs,  namely  UALRCB09rl  through  UALRCB09r4.  The  experiments  conducted  under 
this  investigation  shed  light  not  only  for  the  task  undertaken  this  year  but  also  set  a 
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reliable  foundation  for  the  upeoming  studies  in  this  area.  The  graph  of  nDCG  R  for  all 
the  runs  was  shown  in  the  Figure  5. 

Table  1 ;  Results  for  all  the  twenty  topies 


Topi  c 
1 

# 

Pri 

10 

NDCG  at  1 
Best  Median 
0.2992  0.0597 

worst 

0.0000 

Primary  Retr 
Best  Median 
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0 
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0.0598 
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1 

0 

0 
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4 

0 

0 
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6 

0 

0 

16 

9 
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0.0000 
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0.0000 
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5 
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Figure  5;  Graph  of  nDCG  R  for  all  the  runs 

Conclusion 

In  the  entity  retrieval  study  we  investigated  how  the  Indri  search  engine  performs  for  the 
different  queries  to  retrieve  the  required  entities  in  noisy  web  environments.  As  a  result 
four  officials  run  showing  the  success  ratio  in  getting  the  good  and  efficient  output  for 
retrieving  the  home  pages  were  analyzed.  Positive  results  were  obtained  by  using  the 
complex  queries.  When  doing  the  information  retrieval  on  the  ClueWeb09  we  used  porter 
stemming  on  unstopped  data.  The  Indri  search  engine  is  both  efficient  and  effective  on 
such  large  scale  collections.  Indri  indexed  the  50  million  documents;  5TB  ClueWeb09 
collection  uncompressed  in  2days  and  processed  approximately  one  query  every  second. 
In  terms  of  effectiveness,  phrase  expansion  via  Indri ’s  structured  query  operators  proved 
to  be  a  powerful  asset.  Despite  all  of  this,  we  hope  to  improve  our  system  for  next  year. 
There  are  a  number  of  things  we  aim  to  explore,  including  faster  indexing,  improved 
query  processing  times,  looking  into  further  use  of  complex  queries,  more  effective  query 
expansion  techniques  for  noisy  data. 
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